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Abstract 

We specify an algorithm that builds up a hi- 
erarchy of referential discourse segments from 
local centering data. The spatial extension and 
nesting of these discourse segments constrain 
the reachability of potential antecedents of an 
anaphoric expression beyond the local level 
of adjacent center pairs. Thus, the centering 
model is scaled up to the level of the global 
referential structure of discourse. An empiri- 
cal evaluation of the algorithm is supplied. 

1 Introduction 

The centering model (Grosz et al., 1995) has evolved as 
a major methodology for computational discourse analy- 
sis. It provides simple, yet powerful data structures, con- 
straints and rules for the local coherence of discourse. As 
far as anaphora resolution is concerned, e.g., the model 
requires to consider those discourse entities as potential 
antecedents for anaphoric expressions in the current ut- 
terance Ui, which are available in the forward-looking 
centers of the immediately preceding utterance Ui-i. No 
constraints or rules are formulated, however, that ac- 
count for anaphoric relationships which spread out over 
non-adjacent utterances. Hence, it is unclear how dis- 
course elements which appear in utterances preceding 
utterance Ui-i are taken into consideration as potential 
antecedents for anaphoric expressions in Ui. 

The extension of the search space for antecedents is 
by no means a trivial enterprise. A simple linear back- 
ward search of all preceding centering structures, e.g., 
may not only turn out to establish illegal references but 
also contradicts the cognitive principles underlying the 
limited attention constraint (Walker, 1996b). The solu- 
tion we propose starts from the observation that addi- 
tional constraints on valid antecedents are placed by the 
global discourse structure previous utterances are em- 
bedded in. We want to emphasize from the beginning 
that our proposal considers only the referential properties 



underlying the global discourse structure. Accordingly, 
we define the extension of referential discourse segments 
(over several utterances) and a hierarchy of referential 
discourse segments (structuring the entire discourse). ' 
The algorithmic procedure we propose for creating and 
managing such segments receives local centering data as 
input and generates a sort of superimposed index struc- 
ture by which the reachability of potential antecedents, 
in particular those prior to the immediately preceding ut- 
terance, is made explicit. The adequacy of this definition 
is judged by the effects centered discourse segmentation 
has on the validity of anaphora resolution (cf. Section 5 
for a discussion of evaluation results). 

2 Global Discourse Structure 

There have been only few attempts at dealing with the 
recognition and incorporation of discourse structure be- 
yond the level of immediately adjacent utterances within 
the centering framework. Two recent studies deal with 
this topic in order to relate attentional and intentional 
structures on a larger scale of global discourse coher- 
ence. Passonneau (1996) proposes an algorithm for the 
generation of referring expressions and Walker (1996a) 
integrates centering into a cache model of attentional 
state. Both studies, among other things, deal with the 
supposition whether a correlation exists between partic- 
ular centering transitions (which were first introduced 
by Brennan et al. (1987); cf. Table 1) and intention- 
based discourse segments. In particular, the role of 
SHiFT-type transitions is examined from the perspective 
of whether they not only indicate a shift of the topic be- 
tween two immediately successive utterances but also 
signal (intention-based) segment boundaries. The data 
in both studies reveal that only a weak correlation be- 
tween the SHIFT transitions and segment boundaries can 
be observed. This finding precludes a reliable predic- 
tion of segment boundaries based on the occurrence of 

' Our notion of referential discourse segment should not be 
confounded with the intentional one originating from Grosz & 
Sidner (1986), for reasons discussed in Section 2. 



SHIFTS and vice versa. In order to accommodate to these 
empirical results divergent solutions are proposed. Pas- 
sonneau suggests that the centering data structures need 
to be modified appropriately, while Walker concludes 
that the local centering data should be left as they are 
and further be complemented by a cache mechanism. 
She thus intends to extend the scope of centering in ac- 
cordance with cognitively plausible limits of the atten- 
tional span. Walker, finally, claims that the content of 
the cache, rather than the intentional discourse segment 
structure, determines the accessibility of discourse enti- 
ties for anaphora resolution. 





CbiU„) = Ci,(C/„-i) 
ORC6(t7„-i)undef. 


Ci,(f/„) 7^ 
Cb{Un-l) 


Cp(l7„) 


CONTINUE (C) 


SMOOTH-SHIFT (SS) 


C,{Un)^ 


RETAIN (R) 


ROUGII-SlllFT (RS) 



Table 1 : Transition Types 

As a working hypothesis, for the purposes of anaphora 
resolution we subscribe to Walker's model, in particular 
to that part which casts doubt on the hypothesized de- 
pendency of the attentional from the intentional structure 
of discourse (Grosz & Sidner, 1986, p.l80). We diverge 
from Walker (1996a), however, in that we propose an al- 
ternative to the caching mechanism, which we consider 
to be methodologically more parsimonious and, at least, 
to be equally effective (for an elaboration of this claim, 
cf. Section 6). 

The proposed extension of the centering model builds 
on the methodological framework of functional center- 
ing (Strube & Hahn, 1996). This is an approach to cen- 
tering in which issues such as thematicity or topicality 
are already inherent. Its linguistic foundations relate the 
ranking of the forward-looking centers and the functional 
information structure of the utterances, a notion origi- 
nally developed by Danes (1974). Strube & Hahn (1996) 
use the centering data structures to redefine Danes's tri- 
chotomy between given information, theme and rheme 
in terms of the centering model. The Ch{Un), the most 
highly ranked element of C/(C/„_i) realized in [/„ , cor- 
responds to the element which represents the given in- 
formation. The theme of [/„ is represented by the pre- 
ferred center Cp{Un), the most highly ranked element of 
Cf{Un). The theme/rheme hierarchy of J7„ corresponds 
to the ranking in the C/s. As a consequence, utterances 
without any anaphoric expression do not have any given 
elements and, therefore, no Cb- But independent of the 
use of anaphoric expressions, each utterance must have a 
theme and a C/ as well. 

The identification of the preferred center with the 
theme implies that it is of major relevance for determin- 
ing the thematic progression of a text. This is reflected in 



our reformulation of the two types of thematic progres- 
sion (TP) which can be directly derived from centering 
data (the third one requires to refer to conceptual gener- 
alization hierarchies and is therefore beyond the scope of 
this paper, cf. Danes (1974) for the original statement): 

1. TP with a constant theme: Successive utterances 
continuously share the same Cp. 

2. TP with linear thematization ofrhemes: An element 

of the Cf{Ui-i) which is not the Cp{Ui-i) appears 
in Ui and becomes the Cp{Ui) after the processing 
of this utterance. 



Cf{Ui-i) : 


[ Ci, Cj, . 


Cs ] 




I 




Cfm ■■ 


[ Ci, Ck, . 


.., Ct ] 


Cf{Ui-i) : 


[Ci, Cj, . 


•, Cs ] i< i < S 




/ 




Cf{Ui) : 


[Ci, Cfc, . 


.., Ct ] 



Table 2: Thematic Progression Patterns 

Table 2 visualizes the abstract schemata of TP pat- 
terns. In our example (cf. Table 8 in Section 4), JJ^ to t/3 
illustrate the constant theme, while U'j to U\q illustrate 
the linear thematization of rhemes. In the latter case, 
the theme changes in each utterance, from "Handbuch" 
(manual) via "Inhaltsverzeichnis" (table of contents) to 
"Kapitel" (chapter) etc. Each of the new themes are in- 
troduced in the immediately preceding utterance so that 
local coherence between these utterances is established. 

Danes (1974) also allows for the combination and re- 
cursion of these basic patterns; this way the global the- 
matic coherence of a text can be described by recurrence 
to these structural patterns. These principles allow for 
a major extension of the original centering algorithm. 
Given a reformulation of the TP constraints in center- 
ing terms, it is possible to determine referential segment 
boundaries and to arrange these segments in a nested, 
i.e., hierarchical manner on the basis of which reacha- 
bihty constraints for antecedents can be formulated. Ac- 
cording to the segmentation strategy of our approach, the 
Cp of the end point (i.e., the last utterance) of a discourse 
segment provides the major theme of the whole segment, 
one which is particularly salient for anaphoric reference 
relations. Whenever a relevant new theme is established, 
however, it should reside in its own discourse segment, 
either embedded or in parallel to another one. Anaphora 
resolution can then be performed (a) with the forward- 
looking centers of the linearly immediately preceding ut- 
terance, (b) with the forward-looking centers of the end 
point of the hierarchically immediately reachable dis- 
course segment, and (c) with the preferred center of the 
end point of any hierarchically reachable discourse seg- 
ment (for a formalization of this constraint , cf. Table 4). 



3 Computing Global Discourse Structure 

Prior to a discussion of the algorithmic procedure for hy- 
pothesizing discourse segments based on evidence from 
local centering data, we will introduce its basic build- 
ing blocks. Let x denote the anaphoric expression under 
consideration, which occurs in utterance Ui associated 
with segment level s. The function Resolved(x, s, Ui) 
(cf. Table 3) is evaluated in order to determine the proper 
antecedent ante for x. It consists of the evaluation of 
a reachability predicate for the antecedent on which we 
will concentrate here, and of the evaluation of the predi- 
cate IsAnaphorFor which contains the linguistic and con- 
ceptual constraints imposed on a (pro)nominal anaphor 
{viz. agreement, binding, and sortal constraints) or a tex- 
tual ellipsis (Hahn et al., 1996), not an issue in this paper. 
The predicate IsReachable (cf . Table 4) requires ante to 
be reachable from the utterance Ui associated with the 
segment level s} Reachability is thus made dependent 
on the segment structure DS of the discourse as built 
up by the segmentation algorithm which is specified in 
Table 6. In Table 4, the symbol "=str" denotes string 
equality, N the natural numbers. We also introduce as a 
notational convention that a discourse segment is identi- 
fied by its index s and its opening and closing utterance, 
viz. DS[s.heg] and DS[s.end\, respectively. Hence, we 
may either identify an utterance Ui by its linear text in- 
dex, i, or, if it is accessible, with respect to its hierarchi- 
cal discourse segment index, s (e.g., cf. Table 8 where 
Uz = Uosii.end] or f/is = UDs[z.end])- The discourse 
segment index is always identical to the currently valid 
segment level, since the algorithm in Table 6 implements 
a stack behavior Note also that we attach the discourse 
segment index s to center expressions, e.g., Cb(s, Ui). 



Resolved(x,s,Ui) := 

ante if IsReachable{ante, s,Ui) 

A I s Anaphor F or {x, ante) 

undef else 



Table 3: Resolution of Anaphora 



I sReachable{ante, s , Ui) 
if ante € Cf{s,Ui-i) 

else if ante € C/(s - 1, Unsis-i.end]) 
else if (3w € N : ante =str Gp{v, Uosiv.end]) 
A v<{s-l)) 
A (-i3v' e N : ante =str Cp{v', Uosiv'-end]) 
A V <v') 



Table 4: Reachability of the Anaphoric Antecedent 

Finally, the function Lift(s, i) (cf. Table 5) determines 
the appropriate discourse segment level, s, of an utter- 

^The Cf lists in the functional centering model are totally 
ordered (Strube & Hahn, 1996, p.272) and we here implicitly 
assume that they are accessed in the total order given. 



ance Ui (selected by its linear text index, i). Lift only 
applies to structural configurations in the centering Usts 
in which themes continuously shift at three different con- 
secutive segment levels and associated preferred centers 
at least (cf. Table 2, lower box, for the basic pattern). 



Lift{s,i) := 




' Lift(s-l,i-l) if 




s > 2 A i > 3 




A Cp(s, f/.-i) /Cp(s- l,f/i-2) 


< 


A Cp{s-l,Ui-2)^Cp{s-2,Ui-3) 




A Cp{s,Ui-i)eCf{s-l,Ui-2) 




s else 



Table 5: Lifting to the Appropriate Discourse Segment 



Whenever a discourse segment is created, its starting 
and closing utterances are initialized to the current po- 
sition in the discourse. Its end point gets continuously 
incremented as the analysis proceeds until this discourse 
segment DS is ultimately closed, i.e., whenever another 
segment DS' exists at the same or a hierarchically higher 
level of embedding such that the end point of DS' ex- 
ceeds that of the end point of DS. Closed segments are 
inaccessible for the antecedent search. In Table 8, e.g., 
the first two discourse segments at level 3 (ranging from 
f/s to U^ and i/g to U\\) are closed, while those at level 
1 (ranging from Ui to Uz), level 2 (ranging from Ui to 
U-j) and level 3 (ranging from Ux2 to Ui_z) are open. 

The main algorithm (see Table 6) consists of three ma- 
jor logical blocks (s and Ui denote the current discourse 
segment level and utterance, respectively). 

1. Continue Current Segment. The Cp(s, J/^ i) is 
taken over for Ui. If Ui-i and Ui indicate the end 
of a sequence in which a series of thematizations of 
rhemes have occurred, all embedded segments are 
lifted by the function Lift to a higher level s' . As a 
result of lifting, the entire sequence (including the 
final two utterances) forms a single segment. This 
is trivially true for cases of a constant theme. 

2. Close Embedded Segment(s). 

(a) Close the embedded segment(s) and continue 
another, already existing segment: If Ui does 
not include any anaphoric expression which is 
an element of the Cf{s,Ui-i), then match the 
antecedent in the hierarchically reachable seg- 
ments. Only the Cp of the utterance at the end 
point of any of these segments is considered 
a potential antecedent. Note that, as a side 
effect, hierarchically lower segments are ulti- 
mately closed when a match at higher segment 
levels succeeds. 

(b) Close the embedded segment and open a new, 
parallel one: If none of the anaphoric ex- 
pressions under consideration co-specify the 



Cp{s — 1, C/[s-i.e,!(i])> then the entire Cf at 
this segment level is checked for the given ut- 
terance. If an antecedent matches, the segment 
which contains Ui-i is ultimately closed, since 
Ui opens a parallel segment at the same level of 
embedding. Subsequent anaphora checks ex- 
clude any of the preceding parallel segments 
from the search for a valid antecedent and just 
visit the currently open one. 
(c) Open new, embedded segment: If there is no 
matching antecedent in hierarchically reach- 
able segments, then for utterance Ui a new, em- 
bedded segment is opened. 
3. Open New, Embedded Segment. If none of the 
above cases applies, then for utterance Ui a new, 
embedded segment is opened. In the course of pro- 
cessing the following utterances, this decision may 
be retracted by the function Lift. It serves as a kind 
of "garbage collector" for globally insignificant dis- 
course segments which, nevertheless, were reason- 
able from a local perspective for reference resolu- 
tion purposes. Hence, the centered discourse seg- 
mentation procedure works in an incremental way 
and revises only locally relevant, yet globally irrel- 
evant segmentation decisions on the fly. 

s := 1 
i := 1 

DS[s.heg\ := i 
DS[s.end] := i 
while -1 end of text 

i := i + 1 

TZ := {Resolved{x,s,Ui) \ x € Ui} 
if^rG7^:r=^t^Cp(s,^7i-l) (1) 
then s' := s 
i' ~i 

DS[Lift(s',i').end] ■- i 
dseif ^3r G 7^ : r e C/(s,;7i_i) (2a) 
then found ~ FALSE 
k :— s 

while ^ found A (fe > 1) 
fc := fc - 1 

if 3r G 7?. : r =str Cp{k, U[k.end]) 
then s := k 

DS[s.end] := i 
found := TRUE 
dseiffc = s-l (2b) 
then if 3 r G 7^ : r G 

Cf(k, (7[fc.e„d]) 

then DS\s.beq\ := i 
DS[s.end] := i 
found := TRUE 
if -'found (2c) 
then s := s + 1 

DS[s.beg] := i 
DS[s.end\ := i 
else s := s + 1 (3) 
DS\s.beg\ := i 
DS[s.end] :— i 

Table 6: Algorithm for Centered Segmentation 



4 A Sample Text Segmentation 

The text with respect to which we demonstrate the work- 
ing of the algorithm (see Table 7) is taken from a German 
computer magazine (c't, 1995, No.4, p.209). For ease 
of presentation the text is somewhat shortened. Since 
the method for computing levels of discourse segments 
depends heavily on different kinds of anaphoric expres- 
sions, (pro)nominal anaphors and textual ellipses are 
marked by italics, and the (pro)nominal anaphors are un- 
derlined, in addition. In order to convey the influence of 
the German word order we provide a rough phrase-to- 
phrase translation of the entire text. 

The centered segmentation analysis of the sample text 
is given in Table 8. The first column shows the linear text 
index of each utterance. The second column contains 
the centering data as computed by functional centering 
(Strube & Hahn, 1996). The first element of the Cf, the 
preferred center, Cp, is marked by bold font. The third 
column lists the centering transitions which are derived 
from the Cb/Cf data of immediately successive utter- 
ances (cf. Table 1 for the definitions). The fourth column 
depicts the levels of discourse segments which are com- 
puted by the algorithm in Table 6. Horizontal lines in- 
dicate the beginning of a segment (in the algorithm, this 
corresponds to a value assigrmient to DS[s.beg\). Verti- 
cal lines show the extension of a segment (its end is fixed 
by an assignment to DS[s.end\). The fifth column indi- 
cates which block of the algorithm applies to the current 
utterance (cf. the right margin in Table 6). 

The computation starts at U\, the headline. The 
Cf{Ui) is set to "1260" which is meant as an abbre- 
viation of "Brother HL-1260". Upon initialization, the 
beginning as well as the ending of the initial discourse 
segment are both set to "1". U2 and Us simply con- 
tinue this segment (block (1) of the algorithm), so Lift 
does not apply. The Cp is set to "1260" in all utter- 
ances of this segment. Since U4 does neither contain any 
anaphoric expression which co-specifies the Cp(l,C/3) 
(block (1)) nor any other element of the C/(l, C/3) (block 
(2a)), and as there is no hierarchically preceding seg- 
ment, block (2c) applies. The segment counter s is in- 
cremented and a new segment at level 2 is opened, set- 
ting the beginning and the ending to "4". The phrase 
"das dilnne Handbiichlein" (the thin leaflet) in C/5 does 
not co-specify the Cp(2, U4) but co-specifies an element 
of the Cf{2,U/C) instead (viz. "Handbuch" (manual)). 
Hence, block (3) of the algorithm applies, leading to 
the creation of a new segment at level 3. The anaphor 
"Handbuch" (manual) in C/g co-specifies the Cp(3, U^). 
Hence block (1) applies (the occurrence of "1260" in 
Cf{Ur-,) is due to the assumptions specified by Strube 
& Hahn (1996)). Given this configuration, the func- 
tion Lift lifts the embedded segment one level, so the 



(1) Brother HL- 1260 

(2) Ein Detail fallt schon beim ersten Umgang mit dem 

groBen Brother auf: 

One particular - is already noticed - in the first approach 

to - the big Brother . 

(3) Im Betrieb macht er durch ein kraftiges Arbeitsgerausch 
auf sich aufmerksam, das auch im Stand-by-Modus noch 
gut vemehmbar ist. 

In operation - draws - it - with a heavy noise level - 
attention to itself - which - also - in the stand-by mode - 
is still well audible. 

(4) Fiir Standard-Installationen kommt man gut ohne Hand- 
buch aus. 

As far as standard installations are concerned - gets - one 
- well - by - without any manual. 

(5) Zwar erlautert das diinne Handbuchlein die Bedienung 
der Hardware anschaulich imd gut illustriert. 
Admittedly, gives - the thin leaflet - the operation of the 
hardware - a clear description of - and - well illustrated. 

(6) Die Software-Seite wurde im Handbuch dagegen 
stiefmiitterlich behandelt: 

The software part - was - in the manual - however - like 
a stepmother - treated: 

(7) bis auf eine karge Seite mit einem Inhaltsverzeichnis zum 
HP-Modus sucht man vergebens weitere Informationen. 
except for one meagre page - containing the table of con- 
tents for the HP mode - seeks - one - in vain - for further 
information. 



(8) Kein Wunder: unter dem Inhaltsverzeichnis steht der lap- 
idare Hinweis, man moge sich die Seiten dieses Kapitels 
doch bitte von Diskette ausdrucken ~ Frechheit. 
No wonder: beneath the table of contents - one finds the 

oneself - the pages of this 
- - impertinence. 



terse instruction, one should - 
section - please - from disk - print out - 



(9) Ohne diesen Ausdruck sucht man vergebens nach einem 
Hinweis darauf, warum die Auto-Continue-Funktion in 
der PostScript-Emulation nicht wirkt. 
Without this print-out, looks - one - in vain - for a hint - 
why - the auto-continue-function - in the PostScript em- 
ulation - does not work. 

(10) Nach dem Einschalten zeigt das LC-Display an, daB diese 
praktische Hilfsfunktion nicht aktiv ist; 

After switching on - depicts - the LC display - that - this 
practical help function - not active - is; 

(11) si£ iiberwacht den Dateientransfer vom Computer. 
it monitors the file transfer from the computer. 

(12) Viele der kleinen Macken verzeiht man dem HL-1260 
wenn man erste Ausdrucke in Handen halt. 

Many of the minor defects - pardons - one - the 
HL-1260 , when - one - the first print outs - holds in 
[one's] hands. 

(13) Gerasterte Grauflachen erzeugt der Brother sehr homogen 

Raster-mode grey-scale areas - generates - the Brother - 
very homogeneously . . . 



Table 7: Sample Text 



segment which ended with U4, is now continued up to 
t/g at level 2. As a consequence, the centering data of 
are excluded from further consideration as far as the 
co-specification by any subsequent anaphoric expression 
is concerned, f/7 simply continues the same segment, 
since the textual ellipsis "Seite" (page) refers to "Hand- 
buch" (manual). The utterances Us to U\q exhibit a typ- 
ical thematization-of-the-rhemes pattern which is quite 
common for the detailed description of objects. (Note 
the sequence of shift transitions.) Hence, block (3) 
of the algorithm applies to each of the utterances and, 
correspondingly, new segments at the levels 3 to 5 are 
created. This behavior breaks down at the occurrence 
of the anaphoric expression "sie" (it) in Un which co- 
specifies the Cp(5, C/10), viz. " auto-continue function" , 
denoted by another anaphoric expression, namely "Hil- 
fsfunktion" (help function) in Uw- Hence, block (1) ap- 
plies. The evaluation of Lift succeeds with respect to 
two levels of embedding. As a result, the whole se- 
quence is lifted up to level 3 and continues this segment 
which started at the discourse element "Inhaltsverzeich- 
nis" (list of contents). As a result of applying Lift, the 
whole sequence is captured in one segment. U12 does 
not contain any anaphoric expression which co-specifies 



an element of the (7/(3, J/n), hence block (2) of the al- 
gorithm applies. The anaphor "HL-1260" does not co- 
specify the Cp of the utterance which represents the end 
of the hierarchically preceding discourse segment (Uy), 
but it co-specifies an element of the C/(2, Uj). The im- 
mediately preceding segment is ultimately closed and a 
parallel segment is opened at U\2 (cf. block (2b)). Note 
also that the algorithm does not check the C/ (3 , ?7io ) de- 
spite the fact that it contains the antecedent of "1260". 
However, the occurrences of "1260" in the CfS of Ug 
and Uio are mediated by textual ellipses. If these ut- 
terances contained the expression "1260" itself, the al- 
gorithm would have built a different discourse structure 
and, therefore, "1260" in Uio were reachable for the 
anaphor in U12. Segment 3, finally, is continued by C/13. 



5 Empirical Evaluation 



In this section, we present some empirical data concern- 
ing the centered segmentation algorithm. Our study was 
based on the analysis of twelve texts from the informa- 
tion technology domain (IT), of one text from a German 



Centering Data 



Trans. 



Levels of Discourse Segments 
1 2 3 4 5 



Block 



(1) 



Cb: 
Cf: 



[1260] 



(2) 



Cb: 1260 

Cf: [1260. Umgang, Detail] 



(3) 



Cb: 1260 

Cf : [1260, Betrieb, Arbeitsgerausch, Stand-by-Modus] 



(4) 



TbT 
Cf: 



[Standard-Installation, Handbuch] 



(5) 



Cb: Handbuch 

Cf: [Handbuch, 1260, Hardware, Bedienung] 



(6) 



Cb: Handbuch 

Cf: [Handbuch, 1260, Software] 



(7) 



Cb: Handbuch 

Cf: [Handbuch, Seite, 1260, HP-Modus, 
Inhaltsverzeichnis, Infomiationen] 



r 



1 

1 

2c 
3 

l,Lift 
1 



(8) 



Cb: Inhaltsverzeichnis 

Cf: [Inhaltsverzeichnis, Hinweis, Seiten, Kapitel, 
Diskette, Frechheit] 



SS 



(9) 



Cb: Kapitel 

Cf: [Kapitel, Ausdruck, Hinweis, 1260, 

Auto-Continue-Funktion, PostScript-Emulation] 



SS 



(10) 



Cb: 1260 

Cf: [Auto-Continue-Funktion, 1260, LC-Display] 



RS 



(11) 



(12) 



Cb: Auto-Continue-Funktion 
Cf: [Auto-Continue-Funktion, Dateien-Transfer, 
Computer] 



SS 



Cf: 



[1260, Macken, Ausdruck] 



(13) 



Cb: 1260 

Cf: [1260, Graullachcnj 



r 



r 



3 

1, Lift 

2b 

1 



Table 8: Sample of a Centered Text Segmentation Analysis 



news magazine (Spiegel) ^, and of two literary texts 
(Lit). Table 9 summarizes the total numbers of anaphors, 
textual ellipses, utterances, and words in the test set. 





IT 


Spiegel 


Lit 


E 


anaphors 


197 


101 


198 


496 


ellipses 


195 


22 


23 


240 


utterances 


336 


84 


127 


547 


words 


5241 


1468 


1610 


8319 



neither specified for anaphoric antecedents in Ui, not an 
issue here, nor for anaphoric antecedents beyond Ui-i. 
In the test set, 139 anaphors (28%) and 116 textual el- 
lipses (48,3%) fall out of the (intersentential) scope of 
those common algorithms. So, the problem we consider 
is not a marginal one. 



Table 9: Test Set 

Table 10 and Table 11 consider the number of 
anaphoric and text-elliptical expressions, respectively, 
and the linear distance they have to their correspond- 
ing antecedents. Note that common centering algorithms 
(e.g., the one by Brennan et al. (1987)) are specified 
only for the resolution of anaphors in Ui-i. They are 

'Japan - Der Neue der alten Garde. In Der Spiegel, Nr. 3, 
1996. 

''The first two chapters of a short story by the German 
writer Heiner MiiUer (Liebesgeschichte. In Heiner Miiller. 
Geschichten aus der Produktion 2. Berlin: Rotbuch Verlag, 
1 974, pp. 57-63) and the first chapter of a novel by Uwe Johnson 
(Zwei Ansichten. Frankfurt/Main: Suhrkamp Verlag, 1965.) 
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Table 10: Anaphoric Antecedent in Utterance Ux 

Table 12 and Table 13 give the success rate of the 
centered segmentation algorithm for anaphors and tex- 
tual ellipses, respectively. The numbers in these tables 
indicate at which segment level anaphors and textual el- 
lipses were correctly resolved. The category of errors 
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Table 1 1 : Elliptical Antecedent in Utterance [/, 



covers erroneous analyses the algorithm produces, while 
the one for false positives concerns those resolution re- 
sults where a referential expression was resolved with 
the hierarchically most recent antecedent but not with the 
linearly most recent (obviously, the targeted) one (both of 
them denote the same discourse entity). The categories 
Cf{s,Ui-i) in Tables 12 and 13 contain more elements 
than the categories f7j_i in Tables 10 and 11, respec- 
tively, due to the mediating property of textual ellipses in 
functional centering (Strube & Hahn, 1996). 
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Table 12: Anaphoric Antecedent in Center 
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Table 13: EUiptical Antecedent in Center 



The centered segmentation algorithm reveals a pretty 
good performance. This is to some extent implied by 
the structural patterns we find in expository texts, viz. 
their single-theme property (e.g., "1260" in the sample 
text). In contrast, the literary texts in the test exhibited 
a much more difficult internal structure which resem- 
bled the multiple thread structure of dialogues discussed 
by Rose et al. (1995). The good news is that the seg- 
mentation procedure we propose is capable of dealing 
even with these more complicated structures. While only 
one antecedent of a pronoun was not reachable given the 
superimposed text structure, the remaining eight errors 
are characterized by full definite noun phrases or proper 
names. The vast majority of these phenomena can be 
considered informationally redundant utterances in the 



terminology of Walker (1996b) for which we currently 
have no solution at all. It seems to us that these kinds 
of phrases may override text-grammatical structures as 
evidenced by referential discourse segments and, rather, 
trigger other kinds of search strategies. 

Though we fed the centered segmentation algorithm 
with rather long texts (up to 84 utterances), the an- 
tecedents of only two anaphoric expressions had to 
bridge a hierarchical distance of more than 3 levels. This 
coincides with our supposition that the overall structure 
computed by the algorithm should be rather flat. We 
could not find an embedding of more than seven levels. 

6 Related Work 

There has always been an implicit relationship between 
the local perspective of centering and the global view 
of focusing on discourse structure (cf. the discussion in 
Grosz et al. (1995)). However, work establishing an ex- 
plicit account of how both can be joined in a computa- 
tional model has not been done so far. The efforts of 
Sidner (1983), e.g., have provided a variety of different 
focus data structures to be used for reference resolution. 
This multiplicity and the on-going growth of the number 
of different entities (cf. Suri & McCoy (1994)) mirrors 
an increase in explanatory constructs that we consider a 
methodological drawback to this approach because they 
can hardly be kept control of. Our model, due to its hier- 
archical nature implements a stack behavior that is also 
inherent to the above mentioned proposals. We refrain, 
however, from establishing a new data type (even worse, 
different types of stacks) that has to be managed on its 
own. There is no need for extra computations to deter- 
mine the "segment focus", since that is implicitly given 
in the local centering data already available in our model. 

A recent attempt at introducing global discourse no- 
tions into the centering framework considers the use of a 
cache model (Walker, 1996b). This introduces an addi- 
tional data type with its own management principles for 
data storage, retrieval and update. While our proposal 
for centered discourse segmentation also requires a data 
structure of its own, it is better integrated into centering 
than the caching model, since the cells of segment struc- 
tures simply contain "pointers" that implement a direct 
link to the original centering data. Hence, we avoid ex- 
tra operations related to feeding and updating the cache. 
The relation between our centered segmentation algo- 
rithm and Walker's (1996a) integration of centering into 
the cache model can be viewed from two different angles. 
On the one hand, centered segmentation may be a part 
of the cache model, since it provides an elaborate, non- 
linear ordering of the elements within the cache. Note, 
however, that our model does not require any /prefixed 
size corresponding to the limited attention constraint. On 
the other hand, centered segmentation may replace the 



cache model entirely, since both are competing models 
of the attentional state. Centered segmentation has also 
the additional advantage of restricting the search space of 
anaphoric antecedents to those discourse entities actually 
referred to in the discourse, while the cache model allows 
unrestricted retrieval in the main or long-term memory. 

Text segmentation procedures (more with an informa- 
tion retrieval motivation, rather than being related to ref- 
erence resolution tasks) have also been proposed for a 
coarse-grained partitioning of texts into contiguous, non- 
overlapping blocks and assigning content labels to these 
blocks (Hearst, 1994). The methodological basis of these 
studies are lexical cohesion indicators (Morris & Hirst, 
1991) combined with word-level co-occurrence statis- 
tics. Since the labelling is one-dimensional, this approxi- 
mates our use of preferred centers of discourse segments. 
These studies, however, lack the fine-grained informa- 
tion of the contents of C/ lists also needed for proper 
reference resolution. 

Finally, many studies on discourse segmentation high- 
light the role of cue words for signaling segment bound- 
aries (cf., e.g., the discussion in Passonneau & Litman 
(1993)). However useful this strategy might be, we see 
the danger that such a surface-level description may actu- 
ally hide structural regularities at deeper levels of inves- 
tigation illustrated by access mechanisms for centering 
data at different levels of discourse segmentation. 

7 Conclusions 

We have developed a proposal for extending the cen- 
tering model to incorporate the global referential struc- 
ture of discourse for reference resolution. The hierarchy 
of discourse segments we compute realizes certain con- 
straints on the reachability of antecedents. Moreover, the 
claim is made that the hierarchy of discourse segments 
implements an intuitive notion of the Umited attention 
constraint, as we avoid a simplistic, cognitively implausi- 
ble linear backward search for potentional discourse ref- 
erents. Since we operate within a functional framework, 
this study also presents one of the rare formal accounts of 
thematic progression patterns for full-fledged texts which 
were informally introduced by Danes (1974). 

The model, nevertheless, still has several restrictions. 
First, it has been developed on the basis of a small corpus 
of written texts. Though these cover diverse text sorts 
(viz. technical product reviews, newspaper articles and 
literary narratives), we currently do not account for spo- 
ken monologues as modelled, e.g., by Passonneau & Lit- 
man (1993) or even the intricacies of dyadic conversa- 
tions Rose et al. (1995) deal with. Second, a thorough 
integration of the referential and intentional description 
of discourse segments still has to be worked out. 
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